If a full server recovery will be performed, or if a
number of different procedures will be taken to install service packs,
patches, updates, or other server-recovery attempts such as an attempt
to recover the server, a full backup should be performed on the server.
At first, it might
seem unnecessary to back up a server that isn’t working properly, but
during the problem-solving and debugging process, it is quite possible
for a server to end up in even worse shape after a few updates and fixes
have been applied. The initial problem might have been that a single
mailbox couldn’t be accessed, and after some problem-solving efforts,
the entire server might be inaccessible. A backup provides a rollback to
the point of the initial problem state. When making changes in an
attempt to fix a server, you always want a way to roll back a change if
it turns out to make the situation worse. When the backup is complete,
verify that the backup is valid, ensuring that no open files are skipped
during the backup process or that, if the files are skipped, they are
backed up in other open file backup processes. This way, you will always
have the ability to return to your starting point in case you need to
try a different method to fix the server.
Caution
When performing any
recovery of an Exchange server or resource, be careful what you delete,
modify, or change. As a rule of thumb, never delete objects that are
known throughout the directory; otherwise, you cannot restore the object
because of the uniqueness of each object. As an example, if you plan to
restore an entire server from tape, you do not want to first delete the
server and then add the server back during the restoration process. The
restoration process requires the existence of the old server in the
directory. Deleting the server object and then adding the object again
later gives the object a completely different globally unique identifier
(GUID). Even though you restore the entire Exchange server from tape,
the ID of the server and all the objects in the server will be
different, making it more difficult to recover the server. Other
replicable objects that should not be deleted include public folders,
public folder trees, groups, and distribution lists.
Validating Backup Data and Procedures
Another
important task that should be done before doing any maintenance,
service, or repairs on an Exchange server is to validate that a full
backup exists on the server, test the condition of the backup, and then
secure the backup so that it is safe. Far too many organizations proceed
with risky recovery procedures, believing that they have a fallback
position by restoring from tape, only to realize that the tape backup is
corrupt or that a complete backup does not exist. Equally important is
to be sure that the tape you might need is actually onsite. Many
companies send tapes offsite for storage. If you depend on a particular
backup tape for your rollback, be sure it is readily accessible.
If the administrators
of the network realize that there is no clean backup, the procedures
taken to recover the system might be different than if a backup had
existed. If a full backup exists and is verified to be in good
condition, the organization has an opportunity to restore from tape if a
full restore is necessary. This requirement is somewhat lessened in an
environment where Database Availability Groups are utilized because
those configurations can suffer a failure of a system should something
go wrong during an upgrade or maintenance.
Steps
can be taken to help an organization more easily prepare for a
recoverable environment. This involves documenting server states and
conditions, performing specific backup procedures, and setting up new
features in Exchange Server 2010 that provide for a more simplified
restoration process. By maintaining these processes and performing
regular test restores, a company can feel confident that they can
quickly and easily recover from a disaster. Most notably is the use of
Database Availability Groups to provide for redundant mailbox services.
Because the failover to another replica within a DAG is essentially
transparent to the end users, it is considered a best practice with
Exchange Server 2010 to utilize DAGs.
Documenting the Exchange Server Environment
Key to the success of
recovering an Exchange server or an entire Exchange Server environment
is having documentation on the server configurations. Having specific
server configuration information documented helps to identify which
server is not operational, the routing of information between servers,
and, ultimately, the impact that a server failure or server recovery
will have on the rest of the Exchange Server environment. By having a
complete understanding of the Exchange Server environment as a whole, an
administrator can often bring up temporary services to alleviate a
failure and give themselves more time to fix the issue and determine the
root cause.
Note
A utility called ExchDump can assist an administrator with baselining and improving the environment. Use ExchDump to export and document a server’s configuration. The ExchDump utility can be downloaded from the Microsoft Exchange Server download page at www.microsoft.com/exchange/downloads/2003/default.mspx.
Although this utility was
originally written for Exchange Server 2003, it works fine for
extracting the same information from an Exchange Server 2010 server.
Some of the items that should be documented include the following:
Server name
Server roles held
Version of Windows on servers (including service pack)
Version of Exchange Server on servers (including service pack)
Organization name in Exchange Server
Site names
Database names
Location of databases
Size of databases
When database maintenance was last run
Public folder tree name
Replication process of public folders
Security delegation and administrative rights
Names and locations of global catalog servers
Documenting the Backup Process
To simplify a
restore of an Exchange Server environment, it is important to start with
a clean backup. A clean backup is performed when the proper backup
process is followed. Create a backup process that works, document the
step-by-step procedures to back up the server, follow the procedures
regularly, and then validate that the backups have been completed
successfully.
Also, when
configurations change, the backup process and system configurations
should be documented and validated again, to make sure that the backups
are completed properly.
Documenting the Recovery Process
An
important aspect of recovery feasibility is knowing how to recover from
a disaster. Just knowing what to back up and what scenarios to plan for
is not enough. Restore processes should be created and tested to ensure
that a restore can meet service level agreements (SLAs) and that the
staff members understand all the necessary steps.
When a process is
determined, it should be documented, and the documentation should be
written to make sense to the desired audience. For example, if a failure
occurs in a satellite office that has only marketing employees and one
of them is forced to recover a server, the documentation needs to be
written so that it can be understood by just about anyone. If the
information technology (IT) staff will be performing the restore, the
documentation can be less detailed, but it assumes a certain level of
knowledge and expertise with the server product. The first paragraph of
any document related to backup and recovery should be a summary of what
the document is used for and the level of skill necessary to perform the
task and understand the document.
The recovery process
involved in resolving an Exchange Server problem should also be focused
not only on the goal of getting the entire Exchange server back up and
operational, but also on considering smaller steps that might help
minimize downtime. As an example, if an Exchange server has failed,
instead of trying to restore 10TB of mail back to the server, which can
take hours, if not days, to complete, an organization can choose to
restore just the user Inboxes, calendars, and contacts. After a faster
system recovery of core information on a server, the balance of the
information can be restored over the next several hours.
The other advantage of
having a properly documented restore procedure is that it greatly
reduces the chances of human error occurring during a restore.
Recovering a failed server while hundreds or possibly thousands of email
users are affected is a stressful situation. This isn’t the time to
learn how to perform a restore. The goal in this situation is for the
administrator to follow a clearly documented and well-tested process to
ensure that no steps are missed and that no information is entered
incorrectly. Having well-documented steps can greatly reduce the stress
of this situation and increase the chances of a successful restore.
Even if an
environment is utilizing DAG as their primary form of disaster recovery,
there should still be a documented procedure of what to do in this
situation. Although the rebuild of redundant systems can be delayed, the
longer the delay, the more data will have to be incrementally reseeded
and the longer a company is at a higher risk should other replicas fail.
Including Test Restores in the Scheduled Maintenance
Part of a successful
disaster recovery plan involves periodically testing the restore
procedures to verify accuracy and to test the backup media to ensure
that data can actually be recovered.
Most organizations or administrators assume that if the backup software
reports “Successful,” the backup is good and data can be recovered. If
special backup consideration is not addressed, the successful backup
might not contain everything necessary to restore a server if data loss
or software corruption occurs.
Restores of
file data, application data, and configurations should be performed as
part of a regular maintenance schedule to ensure that the backup method
is correct and that disaster recovery procedures and documentation are
current. Such tests also should verify that the backup media can be read
from and used to restore data. Adding periodic test restores to regular
maintenance intervals ensures that backups are successful and
familiarizes the administrators with the procedures necessary to recover
so that when a real disaster occurs, the recovery can be performed
correctly and efficiently the first time.
These test
restores should occur in a lab environment in which end users won’t be
affected. The restores should vary in type, testing single mailbox
restores, complete server restores, and full site restores in which even
domain controllers might need to be restored from scratch. This helps
ensure that staff members are comfortable with the process and will have
no problem performing a restore in production should the occasion ever
arise.